# Prepare Photosynthesis pathway data from TRY for use

The photosynthetic pathway data from TRY informs on whether plants perform C3-, C4-, or CAM
photosynthesis. These different pathways have ecological relevance, because C4 plants can
assimilate and grow faster under hot and humid conditions, while CAM plants perform better than
others under dry conditions.

*If you intend to clean more than one or two traits, we recommend the use
of the batch pre-processing script. Refer to the [TRY main page](try-label) for details.*

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

## Requirements

To run the script, the following is needed:
- TRY data, available <a href="https://planthub.idiv.de/downloads/" target="_parent">here</a>
- the data.table library may need to be installed

## Code

In [None]:
# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())


Let's get the TRY data

In [None]:
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName %in% c("Photosynthesis pathway", "Leaf photosynthesis pathway")]


To get an overview of the data, we convert values to lowercase, sort them, and show them as
a table.

In [None]:
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]


There are some coded entries we need to take care of. The value "yes" has different meanings here. We will
check those for the different values of the "DataName" column and replace the values by those in "DataName".

In [None]:
datNames <- names(table(TRYSubset[oriVals == "yes"]$DataName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[DataName == datNames[i] & oriVals == "yes"][1:2])
	print("--------------------------------------")
}
for (i in seq_along(datNames)) {
	oriVals[TRYSubset$DataName == datNames[i] & oriVals == "yes"] <- sub(".*: ", "", datNames[i])
}


The value "y" is only used to code C4 photosynthesis.

In [None]:
datNames <- names(table(TRYSubset[oriVals == "y"]$DataName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[DataName == datNames[i] & oriVals == "y"][1:2])
	print("--------------------------------------")
}
oriVals[TRYSubset$DataName == "Plant photosynthetic pathway" & oriVals == "y"] <- "c4"


Numeric values code C3, C4, and CAM photosynthesis.

In [None]:
datNames <- names(table(TRYSubset[oriVals == "1"]$DataName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[DataName == datNames[i] & oriVals == "1"][1:2])
	print("--------------------------------------")
}
oriVals[TRYSubset$DataName == "Plant photosynthetic pathway" & oriVals == "1"] <- "c3"
oriVals[TRYSubset$DataName == "Plant photosynthetic pathway" & oriVals == "2"] <- "c4"
oriVals[TRYSubset$DataName == "Plant photosynthetic pathway" & oriVals == "3"] <- "cam"


Having dealt with the coded entries, we remove entries with question marks, as they are unreliable.

In [None]:
oriVals[grepl("\\?", oriVals)] <- NA


The most important part of the cleaning process is the definition of the search strings to look for.
We use regular expressions in some cases to be more inclusive (or exclusive).

In [None]:
searchNames <- c("c3", "c4", "cam")


We can now search for the strings defined before and give names to the new categories.

In [None]:
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- searchNames


Let's have a look at the results.

In [None]:
# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)


As these categories should be exclusive, we exclude all ambiguous data
by setting our search results to FALSE whenever we found more than
one match in our search.

In [None]:
searchResults[rowSums(searchResults) > 1, ] <- FALSE


Now, we can create new strings with the cleaned values and add them to the observations. To
not remove the original entries, we will create a new column called "CleanedValueStr".

In [None]:
newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA

# integrate into TRY
TRY[TraitName %in% c("Photosynthesis pathway", "Leaf photosynthesis pathway"), CleanedValueStr := newVals]


Although not necessary, we may change the trait name.

In [None]:
TRY[
	TRY$TraitName %in% c("Photosynthesis pathway", "Leaf photosynthesis pathway"),
	TraitName := "Leaf photosynthesis pathway"
]


Let's write the data to a file.

In [None]:
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))
